In [132]:
from IPython.core.display import HTML
HTML("""
<style>
.container { width: 100% !important; padding-left: 1em; padding-right: 2em; }
div.output_stderr { background: #FFA; }
</style>
""")
Out[132]:
Shannon's model of entropy is a weighted sum of the logs of the probabilities of each of the possible outcomes when we make a random selection from a set.
$$ H(t) = - \sum_{i=1}^l \left( P(t=i)\times log_s(P(t=i)) \right) $$The information gain of a descriptive feature can be understood as a measure of the reduction in the overall entropy of a prediction task by testing on that feature.
Computing information gain involves the following 3 equations: $$ H\left(t, \mathcal{D}\right) = - \sum_{l \in levels(t)} \left( P(t=l) \times log_2(P(t=l)) \right) $$
$$ rem\left(d,\mathcal{D}\right) = \sum_{l \in levels\left(d\right)} \underbrace{ \frac{|\mathcal{D}_{d=l}|}{|\mathcal{D}|}}_{\text{weighting}} \times \underbrace{H\left(t, \mathcal{D}_{d=l}\right)}_{ \substack{ \text{entropy of}\\ \text{partition }\mathcal{D}_{d=l} } } $$$$ IG\left(d,\mathcal{D}\right) = H\left(t,\mathcal{D}\right) - rem\left(d, \mathcal{D}\right) $$Let's implement a simplistic ID3 algorithm. We assume that each feature can only have two values. Hence we have two branches per tree: the left branch for the lower value or false, the right branch for higher or true.
There is no consideration of efficiency or even avoiding redundant computations. The algorithm also won't return the actual decision tree. We're just going to use print
statement to indicate what the algorithm does.
However, this could be build out to something more useful...
In [ ]:
### define function information gain IG(d_feature, D_set)
def IG(d_feature, D_set):
# compute the information gain by dividing with feature d
return 0.0
In [ ]:
def splitDataSet(d_feature, D_set):
Left_set = []
Right_set = []
return (Left_set, Right_set)
In [ ]:
def myID3(list_of_features, D_set):
if len(list_of_features)==0:
# we're done
elif len(D_set)<=1:
# we're done
else:
for f in list_of_features:
# which feature has greatest information gain on D_set?
(L_set, R_set) = splitDataSet(fmax, D_set)
remfeatures = list_of_feature.remove(fmax)
myID3(remfeatures, L_set)
myID3(remfeatures, R_set)
In [ ]:
In [ ]:
In [ ]:
The impurity (variance) at a node can be calculated using the following equation: $$ var\left(t, \mathcal{D}\right) = \frac{\sum_{i=1}^n \left( t_i - \bar{t}\right)^2}{n-1} $$
We select the feature to split on at a node by selecting the feature that minimizes the weighted variance across the resulting partitions:
The scikit-learn package provide an implementation of decision tree classifiers. The following example demonstrates the use of the library on the very simplistic Iris data set.
This is an excerpt from the online documentation:
As with other classifiers, DecisionTreeClassifier takes as input two arrays: an array X, sparse or dense, of size [n_samples, n_features]
holding the training samples, and an array Y of integer values, size [n_samples], holding the class labels for the training samples.
Note: In this environment we represent each sample as a row, and features are columns. This is not always the case!
In [133]:
from sklearn import tree
X = [[0, 0], [1, 1]]
Y = [0, 1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
After being fitted, the model can then be used to predict the class of samples:
In [137]:
clf.predict([[1., 0.], [0, 0], [3, 0]])
Out[137]:
DecisionTreeClassifier is capable of both binary (where the labels are [-1, 1]) classification and multiclass (where the labels are [0, ..., K-1]) classification.
Using the Iris dataset, we can construct a tree as follows:
In [138]:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
In [139]:
iris.data
Out[139]:
In [140]:
iris.target
Out[140]:
Once trained, we can export the tree in Graphviz format using the export_graphviz
exporter. Below is an example export of a tree trained on the entire iris dataset.
In order to compute inline images the package pydot
is required.
In [141]:
from sklearn.externals.six import StringIO
import pydot
from IPython.display import Image
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=iris.feature_names)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("iris.pdf")
Image(graph.create_png())
Out[141]:
In [142]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
# Parameters
n_classes = 3
plot_colors = "bry"
plot_step = 0.02
# Load data
iris = load_iris()
for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
[1, 2], [1, 3], [2, 3]]):
# We only take the two corresponding features
X = iris.data[:, pair]
y = iris.target
# Shuffle
idx = np.arange(X.shape[0])
np.random.seed(13)
np.random.shuffle(idx)
X = X[idx]
y = y[idx]
# Standardize
mean = X.mean(axis=0)
std = X.std(axis=0)
X = (X - mean) / std
# Train
clf = DecisionTreeClassifier().fit(X, y)
# Plot the decision boundary
plt.subplot(2, 3, pairidx + 1)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.xlabel(iris.feature_names[pair[0]])
plt.ylabel(iris.feature_names[pair[1]])
plt.axis("tight")
# Plot the training points
for i, color in zip(range(n_classes), plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
cmap=plt.cm.Paired)
plt.axis("tight")
plt.suptitle("Decision surface of a decision tree using paired features")
plt.legend()
plt.show()
A 1D regression with decision tree. The decision trees is used to fit a sine curve with addition noisy observation. As a result, it learns local linear regressions approximating the sine curve.
We can see that if the maximum depth of the tree (controlled by the max_depth parameter) is set too high, the decision trees learn too fine details of the training data and learn from the noise, i.e. they overfit.
In [ ]:
# Import the necessary modules and libraries
import numpy as np
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
%matplotlib inline
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(5 * rng.rand(80, 1), axis=0)
y = np.sin(X).ravel()
y[::5] += 3 * (0.5 - rng.rand(16))
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X, y)
regr_2.fit(X, y)
# Predict
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
# Plot the results
plt.figure()
plt.scatter(X, y, c="k", label="data")
plt.plot(X_test, y_1, c="g", label="max_depth=2", linewidth=2)
plt.plot(X_test, y_2, c="r", label="max_depth=5", linewidth=2)
plt.xlabel("data")
plt.ylabel("target")
plt.title("Decision Tree Regression")
plt.legend()
plt.show()
In [ ]:
## Another Canned Example
In [ ]:
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
# Create a random dataset
rng = np.random.RandomState(1)
X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)
y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T
y[::5, :] += (0.5 - rng.rand(20, 2))
# Fit regression model
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_3 = DecisionTreeRegressor(max_depth=8)
regr_1.fit(X, y)
regr_2.fit(X, y)
regr_3.fit(X, y)
# Predict
X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)
y_3 = regr_3.predict(X_test)
# Plot the results
plt.figure()
plt.scatter(y[:, 0], y[:, 1], c="k", label="data")
plt.scatter(y_1[:, 0], y_1[:, 1], c="g", label="max_depth=2")
plt.scatter(y_2[:, 0], y_2[:, 1], c="r", label="max_depth=5")
plt.scatter(y_3[:, 0], y_3[:, 1], c="b", label="max_depth=8")
plt.xlim([-6, 6])
plt.ylim([-6, 6])
plt.xlabel("data")
plt.ylabel("target")
plt.title("Multi-output Decision Tree Regression")
plt.legend()
plt.show()
In [ ]:
Here a couple of data files from the UCI Machine Learning Repository
Mushroom is a set with over 8,000 samples of categorical attributes of mushrooms. The classifer may determine whether a mushroom is edible, or not.
https://archive.ics.uci.edu/ml/datasets/Mushroom
https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/
Car Evaluation is set with about 1,700 samples evaluating cars as "unacceptable", "acceptable", "good", and "very good".
https://archive.ics.uci.edu/ml/datasets/Car+Evaluation
https://archive.ics.uci.edu/ml/machine-learning-databases/car/
In [143]:
dataset = []
for l in open("car.data").readlines():
dataset.append(l[0:-1].split(','))
dataset[0:9]
Out[143]:
Unfortunately, the scikit-learn algorithm need numerical data. We need to convert the categorical labels into numbers.
First, we determine the levels for each feature. Then we create a dictonary that to convert the labels into numbers.
In [144]:
Nfeature = len(dataset[0])
Nsample = len(dataset)
print("%s x %s" % (Nsample, Nfeature))
In [ ]:
## http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
## use LabelEncoder instead of the home grown...
In [145]:
levdict = [{} for c in range(0, Nfeature)]
for c in range(0, Nfeature):
vec = []
for r in range(0, Nsample):
vec.append(dataset[r][c])
levels = list(set(vec))
for k in range(0, len(levels)):
levdict[c][levels[k]] = k
levdict
Out[145]:
In [146]:
numdata = [ [ levdict[c][dataset[r][c]] for c in range(Nfeature)] for r in range(Nsample) ]
numdata[0:9]
Out[146]:
In [147]:
X = [] # attributes
Y = [] # target
for sample in numdata:
X.append(sample[0:5])
Y.append(sample[6])
[ X[0:9], Y[0:9] ]
Out[147]:
In [148]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
clf
Out[148]:
In [149]:
car_feature_names = ["bying", "maint", "doors", "person", "lug_boot", "safety"]
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=car_feature_names)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("cars.pdf")
###Image(graph.create_png())
Out[149]:
In [150]:
Ncategory = len(levdict[6])
CM = [ [0 for c in range(Ncategory)] for r in range(Ncategory) ]
CM
Out[150]:
In [151]:
Yhat = clf.predict(X)
CM = [ [0 for c in range(Ncategory)] for r in range(Ncategory) ]
for k in range(Nsample):
CM[Yhat[k]][Y[k]] +=1
CM
Out[151]:
In [152]:
for c in CM:
print(c)
In [ ]: